Modeling biological processes as stopped random walks

EFI and Statistical Ecology Section Webinar

December 2, 2024

Contents

  1. CLT for Stopped Random Walks
  2. Application 1: Experimental Data from Charrier et al. (2011)
  3. Application 2: Observational Data from Marsham (1746-1958)
  4. Non-asymptotic Model for Stopped Random Walks

Stopped Random Walks

Many biological processes can be modeled as stopped random walks

  • e.g., the law of the flowering plants: a flower first blooms in the spring after cumulative temperatures reach a threshold.
    • see Réaumur (1735) and Quetelet (1846) (Auerbach 2023)
    • widely known today as the “growing degree day” model in which temperature exerts a “force” on a plant.
    • explains many processes: leafout, insect emergence, etc.

  • A cumulative sum is a random walk (with drift).
    • The walk stops when the threshold is passed.

Simulation

expand for R code
library("tidyverse")
library("knitr")

set.seed(1)

simulation_data <-
  tibble(temp = rnorm(121, 10, 5),
         date = seq(as.Date("2024-01-01"), as.Date("2024-04-30"), "day"))


bloom_date <-
  simulation_data %>%
  filter(date >= as.Date("2024-02-01")) %>%
  mutate(status = cumsum(temp) > 300) %>%
  filter(status == TRUE) %>%
  summarize(bloom_date = first(date)) %>%
  pull(bloom_date)

simulation_data %>% 
  ggplot() + 
  aes(date, 
      cumsum(ifelse((date >= as.Date("2024-02-01")) &
                      (date <= as.Date(bloom_date)),
                    temp, 
                    0))) + 
  geom_line() +
  theme_bw() +
  labs(x = "", y = "cumulative temperature (°C)") +
  geom_vline(xintercept = as.Date("2024-02-01"), linetype = 2) +
  geom_vline(xintercept = as.Date(bloom_date), linetype = 2) +
  annotate("label", 
           y = c(50, 250), 
           x = c(as.Date("2024-02-01"), as.Date(bloom_date)),
           label = c("last\nfrost", "first\nbloom"))

a simulation of the law of flowering plants: a flower first blooms in the spring after cumulative temperatures reach a threshold

Two assumptions

Let \(X_i > 0\) denote the force exerted on the plant on day \(i\).

  • Assume
    1. \(X_i\) have common mean \(\mu\) (the ambient temperature).
    2. \(X_i\) are independent with common variance \(\sigma^2\).

  • Both assumptions can be relaxed.
    • e.g., 1. can be replaced with the assumption that \(X_i\) have mean \(\mu_i\), and \(\sum \mu_i\) does not grow too quickly (regularly varying with index 1).

CLT for stopped random walks

Let \(S_{a}^i = \sum_{j=a}^i X_j\) denote the cumulative force from day \(a\) to \(i\)

Let \(n_{\gamma}\) denote the bloom date.

  • Assume the plant blooms when \(S_a^i\) first passes \(\gamma\),

\[n_{\gamma} = \min \{ i : S_a^i \geq \gamma \}\] When \(\gamma\) is large,

\[n_{\gamma} \ \dot \sim \ \text{Normal} \left ( \gamma / \mu + a, \, \gamma \sigma^2 / \mu^3 \right)\]

Simulation (\(a = 32\), \(\gamma = 300\))

expand for R code
simulate_data <- 
  function(mean_temp)
  tibble(temp = rnorm(121, mean_temp, 5),
         date = seq(as.Date("2024-01-01"),
                    as.Date("2024-04-30"),
                    "day")) %>%
  filter(date >= as.Date("2024-02-01")) %>% 
  mutate(status = cumsum(temp) > 300) %>%
  filter(status == TRUE) %>%
  summarize(bloom_date = first(date)) %>%
  mutate(mean_temp = mean_temp)

simulation_data <-
lapply(rep(c(5, 10, 15, 20, 25), 20), 
       simulate_data) %>%
  Reduce(bind_rows, x = .)

simulation_data %>%  
  head(2) %>%
  kable(align = "c")
bloom_date mean_temp
2024-03-25 5
2024-02-28 10
expand for R code
simulation_data %>%
  mutate(doy = as.numeric(format(bloom_date, "%j"))) %>%
  lm(formula = doy ~ I(1/mean_temp),
     weight = mean_temp^3, 
     data = .) %>%
  broom::tidy() %>%
  mutate(term = c("$a$", "$\\gamma$")) %>%
  kable(align = "c", 
        digits = 2)
term estimate std.error statistic p.value
\(a\) 31.62 0.37 86.36 0
\(\gamma\) 297.13 7.09 41.91 0

Three takeaways from CLT

Recall bloom date \(n_{\gamma} \ \dot \sim \ \text{Normal} \left ( \gamma / \mu + a, \, \gamma \sigma^2 / \mu^3 \right)\). This imples:

  1. \(\mathbb{E}[n_{\gamma}-a] \propto 1/\mu\) and variance \(\text{Var}\left (n_{\gamma} \right) \propto 1/\mu^{3}\)

  2. \(\gamma\) depends on the temperature scale (°C, °F, etc.) and is only identified if \(\mu\) is known

  3. When \(a\) unknown, we can count from any \(b > 0\) since for \(\delta = a - b\),

\(n_{\gamma} + \delta \ \dot \sim \ \text{Normal} \left ( \gamma / \mu + \delta, \, \gamma \sigma^2 / \mu^3 \right)\)

Experimental Data

Experiment (Charrier et al. 2011)

  • 30 walnut trees (Juglans sp.), 6 genotypes at 2 locations.
    • Stems sampled in November from each tree and cut in 7-cm-long pieces with only one bud.
    • Stems chilled (4°C) and then forced (warmed) at different temperatures (5, 10, 15, 20 and 25°C).

  • We examine the relationship between:
    • forcetemp: temperature during forcing
    • response.time: (day of budburst) when 50% of buds unfolded (stage 15 of BBCH scale)

Experiment (Charrier et al. 2011)

expand for R code
charrier11 <- 
  read_csv("data/dwsl.csv") %>%
  filter(datasetID == "charrier11", 
         study == "exp2")

charrier11 %>%
  select(forcetemp, response.time) %>%
  head(5) %>%
  kable(align = "c")
forcetemp response.time
5 175.88
10 97.06
15 31.76
20 24.71
25 22.94

data from Charrier et al. (2011)

Experiment (Charrier et al. 2011)

expand for R code
(figure_charrier <-
  charrier11 %>%
  ggplot() +
  aes(forceday, response.time) +
  geom_point() +
  theme_bw() +
  theme(axis.line = element_line(linewidth = 0.5, colour = "darkgray")) +
  labs(x = "Temperature (°C)",
       y =  "Mean Time until Budburst (days)") +
  ylim(0, 250))

data from Charrier et al. (2011)

Fit using lm

expand R code
figure_charrier +
  geom_smooth(aes(weight = forceday^3), 
              method = "lm", formula = y ~ I(1/x), color = "red")

data from Charrier et al. (2011)

Fit using glm

expand for R code
figure_charrier +
  geom_smooth(method = "glm", formula = y ~ I(1/x),
              method.args = list(family = inverse.gaussian(link = "identity")))

data from Charrier et al. (2011)

Fit using Stan

expand for Stan code
data {
  int<lower=1> n;
  int<lower=1> m_lower;
  int<lower=m_lower> m_upper;
  vector<lower=0>[n] mu;
  vector<lower=0>[n] y;
}
parameters {
  real<lower=0> gamma;
  real<lower=0> sigma;
  real delta;
}
model {
  y ~ normal(gamma / mu + delta, gamma * pow(sigma, 2) / pow(mu, 3));
}
generated quantities {
  vector[m_upper - m_lower + 1] y_pred;
  for(i in m_lower:m_upper) {
    y_pred[i - m_lower + 1] = gamma / i + delta;
  }
}
expand for R code
library("rstan")
options(mc.cores = parallel::detectCores())

# stan_mod1 <- stan_model("stan_code/stan_mod1.stan")
fit1 <- 
  sampling(stan_mod1,
           data = list(n = nrow(charrier11),
                       m_lower = 5,
                       m_upper = 25,
                       mu = charrier11$forceday,
                       y = charrier11$response.time),
           control = list(adapt_delta = 0.95),
           refresh = 0)

figure_charrier +
  geom_ribbon(aes(ymin = ymin, ymax = ymax), fill = "grey", alpha = .5,
              data = 
                tibble(forceday = 5:25,
                       response.time =  colMeans(extract(fit1, "y_pred")[[1]]),
                       ymin = apply(extract(fit1, "y_pred")[[1]], 2, quantile, prob = .025),
                       ymax = apply(extract(fit1, "y_pred")[[1]], 2, quantile, prob = .975))) +
  geom_line(data = 
              tibble(forceday = 5:25,
                     response.time =  colMeans(extract(fit1, "y_pred")[[1]]),),
            color = "orange", linewidth = 1)

data from Charrier et al. (2011)

Observational Data

Observations (Marsham 1746-1958)

  • Robert Marsham recorded the first occurrence of 27 signs of spring each year.
    • Marsham family continued to collect the data after Robert’s death in 1979

  • We model the day leaves first appeared on his oak trees.
    • spring.temp: cumulative daily temperature (2/1 to 4/30)
    • response.time: number of days from January 1 to budburst

  • The first day of temperature accumulation, \(a\), may vary by year.

Observations (Marsham 1746-1958)

expand for R code
marsham <- 
  read_csv("data/Marsham-Combes_UK.csv")  %>%
  select(year, response.time = oak) %>%
  na.omit()

temp <- 
  read_csv("data/meantemp_daily_totals.csv") %>%
  mutate(date  = seq(as.Date("1772-01-01"), as.Date("2024-10-07"), "day"),
         year  = as.numeric(format(date, "%Y")),
         month = as.numeric(format(date, "%m")),
         doy   = as.numeric(format(date, "%j")))

marsham %<>%
  left_join(temp %>% 
              filter(month >= 2,
                     month <= 4) %>% 
              group_by(year) %>% 
              summarize(spring.temp = sum(pmax(0, Value)))) %>%
  na.omit()

marsham %>%
  filter(year < 1774 | year > 1956) %>%
  kable(align = "c")
year response.time spring.temp
1772 134 398.0
1773 113 541.5
1957 104 701.3
1958 122 472.4

data from Marsham (1746-1958)

Observations (Marsham 1746-1958)

expand for R code
(figure_marsham <-
    marsham %>%
    ggplot() +
    aes(spring.temp, response.time) +
    geom_point(alpha = .5) +
    labs(x = "Cumulative Daily Temperature (°C, 2/1 to 4/30)",
         y = "Time until Budburst (number of days after January 1)") +
    theme_bw() +
    ylim(70, 180))

data from Marsham (1746-1958)

Fit using lm/glm

expand R code
figure_marsham +
  geom_smooth(aes(weight = spring.temp^3), 
              method = "lm", formula = y ~ I(1/x), color = "red") +
  geom_smooth(method = "glm", formula = y ~ I(1/x),
              method.args = list(family = inverse.gaussian(link = "identity")))

data from Marsham (1746-1958)

Fit using Stan

expand for Stan code
data {
  int<lower=1> n;
  int<lower=1> m_lower;
  int<lower=m_lower> m_upper;
  vector<lower=0>[n] mu;
  vector<lower=0>[n] y;
}
parameters {
  real<lower=0> gamma;
  real<lower=0> sigma;
  real delta;
  real<lower=0> tau;
}
model {
  y ~ normal(gamma / mu + delta, gamma * pow(sigma, 2) / pow(mu, 3) + tau);
}
generated quantities {
  vector[m_upper - m_lower + 1] y_pred;
  for(i in m_lower:m_upper) {
    y_pred[i - m_lower + 1] = gamma / i + delta + normal_rng(0, tau);
  }
}
expand for R code
# stan_mod2 <- stan_model("stan_code/stan_mod2.stan")

fit2 <- 
  sampling(stan_mod2, 
           data = list(n = nrow(marsham),
                       m_lower = 300,
                       m_upper = 800,
                       mu = marsham$spring.temp,
                       y = marsham$response.time),
           control = list(adapt_delta = 0.95),
           refresh = 0)

figure_marsham +
  geom_ribbon(aes(ymin = ymin, ymax = ymax), fill = "grey", alpha = .5,
              data = 
                tibble(spring.temp = 300:800,
                       response.time =  colMeans(extract(fit2, "y_pred")[[1]]),
                       ymin = apply(extract(fit2, "y_pred")[[1]], 2, quantile, prob = .025),
                       ymax = apply(extract(fit2, "y_pred")[[1]], 2, quantile, prob = .975))) +
  geom_line(data = 
              tibble(spring.temp = 300:800,
                     response.time =  colMeans(extract(fit2, "y_pred")[[1]]),),
            color = "orange", linewidth = 1)

data from Marsham (1746-1958)

Non-asymptotic model

One last takeaway from CLT

  • In experimental setting, day forcing begins known by design.
    • For observational, day forcing begins not known.
    • Cumulative temperature from 2/1 to 4/30 is a proxy for total force. Proxy is accurate when \(\gamma\) large. Why?

  • Recall \(n_{\gamma} \ \dot \sim \ \text{Normal} \left ( \gamma / \mu + a, \, \gamma \sigma^2 / \mu^3 \right)\) when \(\gamma\) is large.
    • When \(\gamma\) is large, \(n_{\gamma}\) is large, and \(n\mu \approx \sum_{i=a}^n \mu_i \approx \sum_{i=b}^n \mu_i\) for any \(a,b << n\). i.e., First few \(\mu_i\) don’t matter.
    • Argument fails when \(\gamma\) is not large.

Non-asymptotic model

  • Assume instead \(X_i > 0\) is normal with mean \(\mu_i\) and variance \(\sigma^2\).

  • Note that \(\mathbb{P} \left (n_{\gamma} \leq m \right ) = \mathbb{P} \left (S_a^m > \gamma \right) = \Phi \left( \frac{\sum_{i=a}^m \mu_i - \gamma}{ \sqrt{m - a} \, \sigma} \right)\) so that the likelihood contribution of each observation is \[\mathcal{L} \left (\gamma, \sigma, a ; \, n_{\gamma}, \{\mu_i\}\right ) = \Phi \left( \frac{\sum_{i=a}^{n_{\gamma}} \mu_i - \gamma}{\sqrt{n_{\gamma}-a} \, \sigma} \right ) - \Phi \left( \frac{\sum_{i=a}^{n_{\gamma}-1} \mu_i - \gamma}{\sqrt{n_{\gamma}-1-a} \, \sigma} \right )\]

  • n.b. \(\sum_{i=a}^{n_{\gamma}} \mu_i = \sum_{i=1}^{n_{\gamma}} \mu_i \, 1_{i \geq a}\) and

\(1_{i \geq a} \approx (1+\exp(-b(i - a)))^{-1}\) when \(b\) is large.

Experiment (Charrier et al. 2011)

expand for Stan code
data {
  int<lower=1> n;
  int<lower=1> m_lower;
  int<lower=m_lower> m_upper;
  vector<lower=0>[n] mu;
  vector<lower=0>[n] y;
}
parameters {
  real<lower=0> gamma;
  real<lower=0> sigma;
  real delta;
}
model {
  for(i in 1:n) {
    target += 
    log(
      Phi(((y[i] - delta) * mu[i] - gamma) / sqrt((y[i] - delta) * sigma)) - 
      Phi(((y[i] - delta - 1) * mu[i] - gamma) / sqrt((y[i] - delta - 1) * sigma))
    );
  }
}
generated quantities {
  vector[m_upper - m_lower + 1] y_pred;
  for(i in m_lower:m_upper) {
    y_pred[i - m_lower + 1] = gamma / i + delta;
  }
}
expand for R code (Experimental)
# stan_mod3 <- stan_model("stan_code/stan_mod3.stan")

fit3 <- 
  sampling(stan_mod3,
           data = list(n = nrow(charrier11),
                       m_lower = 5,
                       m_upper = 25,
                       mu = charrier11$forceday,
                       y = charrier11$response.time),
           control = list(adapt_delta = 0.95),
           refresh = 0,
           init = rep(list(list(sigma = 1e4)), 4))

figure_charrier +
  geom_ribbon(aes(ymin = ymin, ymax = ymax), fill = "grey", alpha = .5,
              data = 
                tibble(forceday = 5:25,
                       response.time =  colMeans(extract(fit3, "y_pred")[[1]]),
                       ymin = apply(extract(fit3, "y_pred")[[1]], 2, quantile, prob = .025),
                       ymax = apply(extract(fit3, "y_pred")[[1]], 2, quantile, prob = .975))) +
  geom_line(data = 
              tibble(forceday = 5:25,
                     response.time =  colMeans(extract(fit3, "y_pred")[[1]]),),
            color = "orange", linewidth = 1)

data from Charrier et al. (2011)

Observational (Marsham 1746-1958)

expand for Stan code
data {
  int<lower=1> n;
  int<lower=1> m_lower;
  int<lower=m_lower> m_upper;
  matrix<lower=0>[n, 366] temp;
  vector<lower=0>[n] y;
}
parameters {
  real<lower=0> gamma;
  real<lower=0> sigma;
  real mu_a;
  real<lower=0> sigma_a;
  vector<lower=1, upper=80>[n] a;
}
model {
  vector[n] mu_sum;
  matrix[n, 366] w;
  
  for(i in 1:n) {
    for(j in 1:366) {
      if(j < y[i]) {
        w[i,j] = inv_logit(10 * (j - a[i]));
      } else {
        w[i,j] = 0;
      }
    }
    
  mu_sum[i] = dot_product(w[i], temp[i]);  
  target += 
    log(
      Phi((mu_sum[i] - gamma) / 
        sqrt((y[i] - a[i]) * sigma)) - 
      Phi((mu_sum[i] - temp[i, to_int(floor(y[i]))] - gamma) / 
        sqrt((y[i] - a[i] - 1) * sigma))
        );
  }
  a ~ normal(mu_a, sigma_a);
}
generated quantities {
  vector[m_upper - m_lower + 1] y_pred;
  for(i in m_lower:m_upper) {
    y_pred[i - m_lower + 1] = 90 * gamma / i + normal_rng(mu_a, sigma_a);
  }
}
expand for R code (Observational)
# stan_mod4 <- stan_model("stan_code/stan_mod4.stan")

temp_matrix <-
  temp %>%
  filter(year %in% marsham$year) %>%
  transmute(doy, year, temp = pmax(0, Value)) %>%
  pivot_wider(names_from = doy, 
              values_from = temp,
              values_fill = 0) %>%
  select(-year) %>%
  as.matrix() 

fit4 <- 
  sampling(stan_mod4, 
           data = list(n = nrow(marsham),
                   m_lower = 300,
                   m_upper = 800,
                   temp = temp_matrix,
                   y = marsham$response.time),
       control = list(adapt_delta = 0.99),
       refresh = 0,
       init = rep(list(list(sigma = 1e6)), 4))

figure_marsham +
  geom_ribbon(aes(ymin = ymin, ymax = ymax), fill = "grey", alpha = .5,
              data = 
                tibble(spring.temp = 300:800,
                       response.time =  colMeans(extract(fit4, "y_pred")[[1]]),
                       ymin = apply(extract(fit4, "y_pred")[[1]], 2, quantile, prob = .025),
                       ymax = apply(extract(fit4, "y_pred")[[1]], 2, quantile, prob = .975))) +
  geom_line(data = 
              tibble(spring.temp = 300:800,
                     response.time =  colMeans(extract(fit4, "y_pred")[[1]])),
            color = "orange", linewidth = 1)

data from Marsham (1746-1958)

Conclusion

Model comparison

expand for R code
tibble(data = c(rep("Charrier", 2), rep("Marsham", 2)), 
       model = rep(c("asymptotic", "non-asymptotic"), 2),
       gamma = c(mean(extract(fit1, "gamma")[[1]]),
           mean(extract(fit3, "gamma")[[1]]),
                mean(extract(fit2, "gamma")[[1]]) / 90,
                mean(extract(fit4, "gamma")[[1]])),
       lower = c(quantile(extract(fit1, "gamma")[[1]], p = .025),
                 quantile(extract(fit3, "gamma")[[1]], p = .025),
                 quantile(extract(fit2, "gamma")[[1]], p = .025) / 90,
                 quantile(extract(fit4, "gamma")[[1]], p = .025)),
       upper = c(quantile(extract(fit1, "gamma")[[1]], p = .975),
                 quantile(extract(fit3, "gamma")[[1]], p = .975),
                 quantile(extract(fit2, "gamma")[[1]], p = .975) / 90,
                 quantile(extract(fit4, "gamma")[[1]], p = .975))) %>%
  kable(align = "c", 
        digits = 0,
        col.names = c("data", 
                     "model", 
                     "$\\widehat{\\gamma}$",
                     "$\\widehat{\\gamma}_{\\text{lower}}$",
                     "$\\widehat{\\gamma}_{\\text{upper}}$"))
data model \(\widehat{\gamma}\) \(\widehat{\gamma}_{\text{lower}}\) \(\widehat{\gamma}_{\text{upper}}\)
Charrier asymptotic 769 467 1069
Charrier non-asymptotic 992 806 1186
Marsham asymptotic 247 213 282
Marsham non-asymptotic 383 343 417

Summary

  1. We justified modeling biological process as stopped random walks.
  2. We reviewed the CLT for stopped random walks.
  3. We applied the CLT to experimental and observational data.

  • Found CLT approximation compared well to non-asymptotic model.
    • The model can be complicated to allow for more covariates or additional variation.

References

  1. Auerbach, Jonathan. (2023). A demonstration of the law of the flowering plants. Real World Data Science. https://realworlddatascience.net/ideas/tutorials/posts/2023/04/13/flowers.html

  2. Charrier, G., Bonhomme, M., Lacointe, A., & Améglio, T. (2011). Are budburst dates, dormancy and cold acclimation in walnut trees (Juglans regia L.) under mainly genotypic or environmental control?. International journal of biometeorology, 55(6), 763-774. https://pubmed.ncbi.nlm.nih.gov/21805380/

  3. Marsham, R. (1789). XIII. Indications of spring, observed by Robert Marsham, Esquire, FRS of Stratton in Norfolk. Latitude 52° 45’. Philosophical Transactions of the Royal Society of London, (79), 154-156.